Skip to main content

Reply to this post | Go Back
View Post [edit]

Poster: Nemo_bis Date: Mar 11, 2014 4:15am
Forum: web Subject: Google's robots.txt rules interpreted too strictly by Wayback machine*/!forum/books says "Page cannot be crawled or displayed due to robots.txt".
However their robots.txt contains
Allow: /a/
and says
# robots.txt for Google Groups. See this URL for documentation on robots.txt:
# Note in particular that "the most specific rule based on the length of the
# [path] entry will trump the less specific (shorter) rule."

Edit: There are a few bug reports against heritrix about these sort of rules and most of them are closed, I don't know where the problem lies. /

This post was modified by Nemo_bis on 2013-11-11 07:28:37

This post was modified by Nemo_bis on 2014-03-11 11:15:46